Week 02: Lexical Analysis & Finite Automata

Lexical Analysis

Token Class (or Class)

Classify program substrings according to role (token class), class corresponding to sets of strings:

Identifier : string of letters or digits, starting with a letter
Integer: a non-empty string of digits
Keyword: else, if, begin, ...
Whitespace: a non-empty sequence of blanks, newlines, tabs

Goal of Lexical Analysis

Definition
- lexeme: A lexeme is a sequence of characters that matches the pattern for a token.
- token: A token is a pair consisting of the token name and the value.
Concept
- Parttition the input string into lexemes.
- Identity the token of each lexeme.
- Communicate tokens to the parser.
Input: Program Substrings
Output: Tokens
Sample Input: string (foo=42)
Sample Output: <class, "string">, <'('>, <ID, "foo">, <Operator, "=">, <"Int", "42>, <')'>
Remark
- Left-to-Right scan => lookahead required.

Regular Languages

Regular Expression

Concept: Regular expressions (syntax) specify regular languages (set of strings).
Two base cases
- Single Character: $'c' = \{ "c" \}$
- Empty String: $\epsilon = \{ "" \}$
3 compound expressions
- Union: $A + B = \{a\ |\ a \in A\} \cup \{ b\ |\ b \in B \}$
- Concatenation: $AB = \{ab\ |\ a \in A \wedge b \in B \}$
- Iteration: $ A^* = \bigcup_{i \geq 0} A^i, \begin{cases} A^i = A...A \\ A^0 = \epsilon \end{cases} $

Def. The regular expression over $\Sigma$ are the smallest set of expressions including: $$ \begin{align} R &= \epsilon \\ &|\ 'c', c\in\Sigma\\ &|\ R+R\\ &|\ RR\\ &|\ R^* \end{align} $$

Formal Languages

Def. Let $\Sigma$ be a set of characters (an alphabet). A language over $\Sigma$ is a set of strings of characters drawn from $\Sigma$. Meaning function $L$ maps regular expressions to set of strings.

$$ \begin{align} L(\epsilon) &= \{ "" \} \\ L('c') &= \{ "c" \} \\ L(A+B) &= L(A) \cup L(B) \\ L(AB) &= \{ab\ |\ a \in L(A) \wedge b \in L(B)\} \\ L(A^*) &= \bigcup_{i \geq 0} L(A)^i \\ \end{align} $$

Q:Why use a meaning function?

Make clear what is syntax, what is semantics.
Allow us to consider notation as a separate issue.
Because expression and meanings are not 1-1.
- Meaning is many to one.

Lexical Specifications

Lexemes

Keyword: "if" / "else" / "then" / ...
- $ 'if' + 'else' + 'then' + ...$
Integer: a non-empty string of digits
- $digits: '0' + '1' + '2' + ...$
- $(digit)(digit)^* = digit^+$
Identifier: strings of letters or digits, starting with a letter
- $letter = 'a' + 'b' + 'c' + ... = [a-zA-Z]$
- $(letter)(digit+letter)^*$
Whitespace: a non-empty sequence of blanks, newlines, and tabs
- $('\;' + '\setminus n' + '\setminus t')^+$

More regular expressions

At least one: $A^+ = AA^*$
Union: $A|B = A+B$
Option: $A? = A+\epsilon$
Range: $'a'+'b'+\dots+'z' = [a-z]$
Excluded Range: $\widehat{[a-z]} = [\wedge a-z]$

How do we check $program \in L(R)$?

Write a regular expression for the lexemes of each token class
- Number $= digit^+$
- Keyword $= 'if'+'else'+\dots$
- Identifiew $= letter(letter+digit)^*$
- OpenPar $='('$
- ...
Construct $R$ to match all lexemes.
- $R = Number + Keyword + Number + ... = R_1 + R_2 + ...$
Let input be $x_1...x_n$. Find the longest length $i$ such that $x_1...x_i \in L(R)$.
Remove $x_1...x_i$. Go to step 3.
If $x_1...x_i \in L(R_a) \cap L(R_b)$, apply $R_{\min(a,b)}$.